Project Goal: Threefold: (1) Transformation and cleaning of data. (2) Exploratory Data Analysis of Dating Site Data. (3) Machine Learning modelling.

Introduction

The objective of this study is to investigate the critical factors that contribute to an individual’s appeal, popularity, and recognition within an online dating platform. The data utilised for this research is sourced from Lovoo, a prominent European dating application, and is accessible via Kaggle.

The underlying motivation for this study stems from the desire to comprehend behavioural patterns that transcend the confines of physical attractiveness. The aim is to unveil hidden determinants that may shape interpersonal interactions within a digital dating platform. The behaviour exhibited on these platforms carries significance, even in economic contexts. By deciphering this behavioural paradigm, it can potentially contribute to the development of economic models. These enhanced models can subsequently offer a more profound analytic framework to elucidate overall mate-selection behaviour.

The initial phase of the analysis involves the transformation of raw data into a more interpretable format. This includes the creation of additional variables tailored to augment the predictive capacity of the statistical models employed in subsequent stages. This phase facilitates the exploratory aspect of the research, enabling an in-depth examination of data in search of potential predictor variables. The objective extends beyond understanding the phenomena; the aim is to anticipate which factors instigate an increased number of profile views and, subsequently, the ‘likes’ received.

The modelling process is a two-step approach. The first stage focuses on identifying variables that may elucidate why individuals view a certain profile. Potential variables include online presence, age, geographical location, and the timing of an individual’s online activity. The second stage aims to identify factors that influence the likelihood of a profile receiving ‘likes’. These may include the number of pictures on a profile, the characteristics of a profile’s biography, languages spoken, profile verification status, and mobile usage.

A decision tree model and random forest model are employed in this study, simply due to their robust ability to discern intricate characteristics that influence outcomes. The application of decision trees and random forests provides a comprehensive analysis approach, as it combines the simplicity and interpretability of decision trees, with the robustness and increased accuracy of random forests, to predict the distinct measures of user engagement on the dating platform. By training two models for each method, I ensure a more detailed understanding of the factors driving both profile views and likes.

Part 1: Transformation and Cleaning

The dataset in consideration comprises 3 973 observations approximately 30 variables, each encapsulating specific attributes pertaining to individual profiles and related demographic information. An excerpt of the dataset is provided in Table 1, supplemented by Table 2 which provides more descriptives of a selection of significant variables. It’s noteworthy to mention that the dataset solely encompasses data of individuals identifying as female. As such, the core objective of this analysis is to discern the determinants influencing the behavioural patterns of individuals displaying interest in females.

Table 1: Head of dataframe.

genderLooking age counts_details counts_pictures counts_profileVisits counts_kisses flirtInterests_chat verified lang_count lang_de whazzup
M 25 1.00 4 8279 239 TRUE 0 1 TRUE Nur tote fische schwimmen mit dem strom
M 22 0.85 5 663 13 TRUE 0 3 TRUE Primaveraaa<3
M 21 0.00 4 1369 88 FALSE 0 0 FALSE NA
none 20 0.12 3 22187 1015 TRUE 0 2 FALSE Je pense donc je suis. Instagram quedev
M 21 0.15 12 35262 1413 TRUE 0 1 TRUE Instagram: JESSSIESCH

Table 2: Description of variables in data set.

Variable Description
genderLooking Preferred gender the subject is looking to engage with. Represented as ‘M’ for male, ‘F’ for female, ‘both’ for male and female, or ‘none’.
age Age of the individual.
counts_details How complete the profile is. Proportion of detail in the account. Measured from 0.0-1.0.
counts_pictures How many pictures does the profile contain.
counts_profileVisits How many times the profile has been viewed.
counts_kisses Number of ‘kisses’ or ‘likes’ received by profile.
flirtInterests_* What the individual is interested in. ’*’ represents: ‘chat’, ‘date’, ‘friends’.
verified Whether the profile has been verified or not.
lang_count Number of languages spoken by an individual.
lang_* Language spoken by an individual. ’*’ represents: ‘en’ (English), ‘de’ (German), ‘fr’ (French), ‘it’ (Italian), ‘es’ (Spanish).
whazzup A phrase that represents the profile’s ‘bio’.

The original dataset is already quite useable, but we can produce better models by adding some new variables. The first step is to take a closer look at the language people use in their profiles. I am focusing on two main things here: the words used in the profile descriptions, and the use of emojis. Both of these could give insights into the person’s confidence and desirability.

I created two new dummy variables, has_emoji and contains_popular_word. has_emoji attributes a ‘1’ based on whether wazzup contains an emoji. contains_popular_word attributes a ‘1’ based on whether wazzup contains a popular word. The code chunk also outputs which words are the most popular in a word cloud. (The word cloud is a dynamic image that shows the popularity when hovering over a specific word)

Part 2: Exploratory Data Analysis

This segment aims to identify underlying patterns and relationships within the dataset. An initial step involves visually inspecting the variables, helping to assess their potential relevance and impact on the outcomes of interest. As a fundamental part of exploratory data analysis, these visual inspections allows one to discern which features could be instrumental in shaping predictive models.

As hinted in the introductory section, it quickly becomes apparent that specific variables have a more pronounced influence on the number of profile ‘Likes’, while others may largely dictate the number of ‘Profile Views’. This distinction is crucial, as certain profile elements only become observable once a profile is viewed. For instance, the information in a profile biography only comes into play during a profile view. Therefore, the dynamics of what draws views and subsequently encourages likes may differ significantly, although both are important aspects of profile engagement.

Interestingly, despite these differences, one notices a robust correlation between profile views and likes. This interplay implies that a successful profile is not just about attracting views but also about converting those views into likes. Figure 2 visually represents this relationship, further illuminating the interdependent nature of profile views and likes. Uncovering these patterns provides essential insights that can inform our subsequent modeling efforts.

Figure 2: Bubble plot of profile views and number of pictures in profile. A non-linear model (loess method) was fitted on the plot to discern possible patterns and differences between bios with emojis and those without. Size of dots present how detailed the account is.

Biography characteristics and popularity

Figure 3 below aims to present whether there is a difference in the distribution of likes received based on the newly created dummy variables, has_emoji, contains_popular_word, and night_owl. There seem to be some slight differences in likes received, supporting the idea that the use of emojis and certain words do suggest higher levels of trust. Being online during night time also may increase profile views, but I view this variable more as a control variable, rather than a causal one, as more people tend to be online during night time than in day time.

Figure 3: Violin plots showing effects of profile characteristics on popularity. Left panel shows effect of a bio containing social media particulars and/or an emoji on likes received. Right panel shows effect of an online profile and/or being a night owl on number of profile visits.

In terms of the structure of an individual’s bio, I have utilized bio length as a proxy for word complexity, with the hypothesis that longer bios may reflect a higher degree of linguistic complexity. The underlying assumption is that users who write longer bios may use a wider range of vocabulary and complex sentence structures, reflecting their capacity to express intricate thoughts or feelings. However, it’s important to note that length does not necessarily equate to complexity — short bios can also be highly nuanced and complex while long bios might be repetitive or simplistic. Hence, although bio length provides a starting point for analysis, more sophisticated measures of textual complexity could be desirable for a more comprehensive understanding.

The scatterplot in the right panel of figure 4 suggests there is no clear linear relationship between the standardized complexity of bios and the count of kisses received, indicating that a more complex bio does not necessarily attract more interactions. Although the presence of emojis differentiates two clusters, it doesn’t appear to be a strong influence on the count of likes either. This analysis challenges initial assumptions about what factors might drive interaction. However, it’s possible that other variables not considered here, such as user activity level or profile picture, could be more impactful. While the plot offers initial insights, it also points towards the need for a more comprehensive exploration of factors influencing user interaction.

Figure 4: Visualisation of bio complexity. Left panel shows distribution of length of bio. Right panel shows a scatter plot between bio length and number of likes received; a linear model was fitted to scrutinise any possible relationship between the variables.

Geographical characteristics and popularity

Utilising the Google Maps API, I successfully geocoded the locations of all profiles present in the dataset. The primary objective behind this was to explore and visualise the potential impact of geographical location on profile views. The role of location might be significant, considering how geographical and cultural aspects can influence user interactions and preferences on the platform. The code snippet below shows the process to perform the geocoding operation.

The subsequent bubble plot illustrates a few disparities among cities. However, these contrasts are not significant enough to confirm any clear geographical trends in profile views. Thus, it is not feasible to definitively say that some regions show more inclination towards profile views than others based on this representation.

To gain a more insightful understanding, a choropleth map is utilized. This geographical representation not only gives a visual interpretation of data but also enhances comprehension through color-coding. Upon implementing this, it becomes noticeably clear that certain countries indeed experience higher profile views on average.

In particular, profiles originating from Spain, Hungary, and the Netherlands tend to attract more attention compared to other European countries. The reasons behind these trends can be plenty - cultural nuances, user behaviors, or the presence of more active users in these regions. Future investigation might delve deeper into these aspects to provide more concrete explanations for the observed patterns.

Figure 4: Geographic data visualisation of profile views. Size and colour of bubbles in left panel indicate profile views. Colour of country in the right panel indicate profile views.

When the data is visualized on a map, one notices that profiles from certain countries tend to get more views. But there’s more to the story than geography.

I produced a lollipop chart (figure 5 below) to show the number of users in each region, with the colour of the lollipop indicating mean profile views. What we see is interesting - a country’s overall popularity didn’t necessarily match up with the number of profile views. This discrepancy can be chalked up to what we call ‘sample size bias.’ Simply put, countries with less users naturally have a higher total number of views, due to a few very popular individuals pushing up the numbers.

As it turns out that using a profile’s country of origin as a way to predict its popularity might be misleading. To make the final model as accurate as possible, it was decided to leave this variable out of the mix.

Figure 5: Lollipop chart of number of users by country. The colour of the lollipops indicate mean profile views.

Other profile characteristics and popularity

In this sub-section, the objective is to ascertain the impact of various profile attributes on the degree of popularity experienced on the dating application. The attributes under scrutiny regard an array of factors, including the number of pictures a profile has, its verification status, whether it can be shared, and the expressed interests of the profile owner, among others.

In order to illuminate the relationships between these variables and the response - profile likes - a correlogram has been produced, which reveals some notable insights. For instance, a slight negative correlation is observed between profile views and factors such as age, interests leaning towards ‘just friends’, and shareability of the profile. On the contrary, having a verified status and showcasing multilingual abilities are positively correlated with profile likes, signifying their potential influence in enhancing a profile’s appeal.

Figure 6: Correlogram of profile characteristics and number of likes received.

The significance of language as a determinant of popularity was also explored in this analysis. This was reflected in Figure 6, where the number of languages spoken was considered as a potential predictor of popularity. Subsequently, Figure 7 provides a visualisation of the distribution of received likes in relation to specific languages spoken by the profiles.

Despite these considerations, the investigation does not reveal a discernible difference in the distribution of profile likes contingent on the languages spoken. The absence of any substantial differentiation in this context suggested that the language factor may not hold significant sway over profile popularity. Consequently, the language variable was not included in the formulation of the final predictive models.

Figure 7: Ridgeline plot of languages spoken and number of likes received. Dashed line shows overall mean profile likes.

The perceived attractiveness of a profile is often regarded as a significant determinant of mate searching behaviour. However, the dataset at hand does not include any direct measures of perceived attractiveness. Nevertheless, we have access to a proxy for this attribute, namely the number of pictures present in a profile. While it may not be the most accurate representation of attractiveness, it offers some insight into the visual appeal of a profile.

In conjunction with this, the presence of social media tags on a profile was also examined, given that these tags may serve as additional indicators of social validation or popularity.

Upon examining Figure 8, we observe a correlation between the number of pictures in a profile and the number of profile likes. Specifically, profiles with a lower number of pictures tend to have fewer profile likes, compared to their counterparts with a similar number of pictures but also featuring social media tags. As the number of pictures increases, the distinction between profiles with and without social media tags becomes less apparent.

This implies that while social media tags can enhance the visibility of a profile, their impact diminishes as the number of pictures increases. Thus, the number of pictures in a profile, serving as a rudimentary indicator of attractiveness, can also influence the popularity of a profile to some degree.

Figure 8: Dotted line plot of the number of pictures in profile and likes received. The mean number of likes received by number of photos was used to plot this relationship. Lines split based on social media tag presence in profile.

Part 3: Modelling & Results

After data preparation and exploration, I proceeded with modelling. As a part of the modelling approach for this analysis, I have elected to implement two popular machine learning techniques: decision trees and random forests. These methods were chosen due to their interpretability, effectiveness in handling complex datasets, and their capacity for both classification and regression tasks.

In each of these chosen techniques, two separate models were trained to serve distinct predictive purposes. The first model targets the prediction of profile views, while the second model aims at forecasting profile likes. This dual-model approach was adopted in recognition of the distinct factors that could potentially influence these two different measures of user engagement. Each model is trained on a different set of predictor variables, carefully chosen based on the insights gathered during the data exploration phase.

The first step involves partitioning it into training and testing subsets. I proceed to divide the dataset into training and testing subsets. For this analysis, I have adopted the widely used practice of a 70/30 split, whereby 70% of the data forms the training set and the remaining 30% is reserved for testing. This allocation ensures a balance - ample data to train the model effectively, whilst retaining a substantial portion for assessing the model’s performance with unseen data. The code demonstrated below provides the method I employed to execute this data split. Moreover, I undertook this process twice. This resulted in two distinct sets - one set for profile views and another for profile likes, enabling a targeted examination of each aspect of profile engagement.

Decision Tree Model

Decision trees are a type of predictive modeling approach. It is called a decision tree because it brings about a tree-like model of decisions. In this instance, two decision tree models were constructed - one for profile visits and another for profile likes.

For the profile visits model, three predictor variables were considered: ‘isOnline’ (whether the user is currently online), ‘night_owl’ (whether the user is active during nighttime hours), and ‘age’.

The profile likes model, on the other hand, was slightly more complex, considering a wider array of variables including ‘has_emoji’, ‘has_social’, ‘Profile_Views’, ‘counts_pictures’, ‘lang_count’, ‘flirtInterests_chat’, ‘flirtInterests_date’, ‘flirtInterests_friends’, and ‘counts_details’. These variables were deemed to be potentially relevant to the number of likes a profile receives.

Once each decision tree model was trained, the ‘summary’ function was invoked to provide a comprehensive view of the models’ characteristics. It includes details such as variable importance, split points, and node summary, providing valuable insights into the models’ decision-making process.

## Call:
## rpart(formula = Profile_Views ~ isOnline + night_owl + age, data = training_visits, 
##     method = "class")
##   n= 2780 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.09352518      0 1.0000000 1.0508393 0.01033357
## 2 0.01000000      1 0.9064748 0.9064748 0.01179770
## 
## Variable importance
##  isOnline night_owl       age 
##        77        14         8 
## 
## Node number 1: 2780 observations,    complexity param=0.09352518
##   predicted class=Low   expected loss=0.75  P(node) =1
##     class counts:   695   695   695   695
##    probabilities: 0.250 0.250 0.250 0.250 
##   left son=2 (1658 obs) right son=3 (1122 obs)
##   Primary splits:
##       isOnline  < 0.5  to the right, improve=29.722160, (0 missing)
##       age       < 21.5 to the right, improve=11.880220, (0 missing)
##       night_owl < 0.5  to the left,  improve= 3.611883, (0 missing)
##   Surrogate splits:
##       night_owl < 0.5  to the left,  agree=0.671, adj=0.185, (0 split)
##       age       < 20.5 to the right, agree=0.640, adj=0.109, (0 split)
## 
## Node number 2: 1658 observations
##   predicted class=Low   expected loss=0.6930036  P(node) =0.5964029
##     class counts:   509   438   397   314
##    probabilities: 0.307 0.264 0.239 0.189 
## 
## Node number 3: 1122 observations
##   predicted class=High  expected loss=0.6604278  P(node) =0.4035971
##     class counts:   186   257   298   381
##    probabilities: 0.166 0.229 0.266 0.340
## Call:
## rpart(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views + 
##     counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date + 
##     flirtInterests_friends + counts_details, data = training_kisses, 
##     method = "class")
##   n= 2779 
## 
##          CP nsplit rel error    xerror       xstd
## 1 0.3288201      0 1.0000000 1.0000000 0.01112283
## 2 0.1571567      1 0.6711799 0.6711799 0.01274569
## 3 0.1542553      2 0.5140232 0.5164410 0.01239983
## 4 0.0100000      3 0.3597679 0.3597679 0.01128688
## 
## Variable importance
##          Profile_Views        counts_pictures         counts_details 
##                     61                     17                      9 
##              has_emoji             lang_count             has_social 
##                      6                      3                      2 
## flirtInterests_friends    flirtInterests_chat 
##                      1                      1 
## 
## Node number 1: 2779 observations,    complexity param=0.3288201
##   predicted class=Low       expected loss=0.7441526  P(node) =1
##     class counts:   711   686   689   693
##    probabilities: 0.256 0.247 0.248 0.249 
##   left son=2 (1406 obs) right son=3 (1373 obs)
##   Primary splits:
##       Profile_Views   splits as  LLRR,      improve=458.81820, (0 missing)
##       counts_pictures < 3.5   to the left,  improve=107.24340, (0 missing)
##       counts_details  < 0.02  to the left,  improve= 43.05716, (0 missing)
##       has_emoji       < 0.5   to the left,  improve= 22.35618, (0 missing)
##       has_social      < 0.5   to the left,  improve= 16.14522, (0 missing)
##   Surrogate splits:
##       counts_pictures < 3.5   to the left,  agree=0.677, adj=0.346, (0 split)
##       counts_details  < 0.75  to the left,  agree=0.596, adj=0.182, (0 split)
##       has_emoji       < 0.5   to the left,  agree=0.566, adj=0.122, (0 split)
##       lang_count      < 1.5   to the left,  agree=0.535, adj=0.059, (0 split)
##       has_social      < 0.5   to the left,  agree=0.534, adj=0.057, (0 split)
## 
## Node number 2: 1406 observations,    complexity param=0.1571567
##   predicted class=Low       expected loss=0.5007112  P(node) =0.5059374
##     class counts:   702   551   149     4
##    probabilities: 0.499 0.392 0.106 0.003 
##   left son=4 (692 obs) right son=5 (714 obs)
##   Primary splits:
##       Profile_Views   splits as  LR--,      improve=249.316200, (0 missing)
##       counts_pictures < 1.5   to the left,  improve= 32.799080, (0 missing)
##       counts_details  < 0.02  to the left,  improve= 14.828960, (0 missing)
##       lang_count      < 3.5   to the right, improve=  5.029280, (0 missing)
##       has_emoji       < 0.5   to the left,  improve=  3.568267, (0 missing)
##   Surrogate splits:
##       counts_pictures        < 1.5   to the left,  agree=0.624, adj=0.237, (0 split)
##       counts_details         < 0.06  to the left,  agree=0.571, adj=0.129, (0 split)
##       flirtInterests_friends < 0.5   to the left,  agree=0.546, adj=0.077, (0 split)
##       has_emoji              < 0.5   to the left,  agree=0.525, adj=0.035, (0 split)
##       flirtInterests_chat    < 0.5   to the left,  agree=0.520, adj=0.025, (0 split)
## 
## Node number 3: 1373 observations,    complexity param=0.1542553
##   predicted class=High      expected loss=0.4981792  P(node) =0.4940626
##     class counts:     9   135   540   689
##    probabilities: 0.007 0.098 0.393 0.502 
##   left son=6 (680 obs) right son=7 (693 obs)
##   Primary splits:
##       Profile_Views   splits as  --LR,      improve=244.991200, (0 missing)
##       counts_pictures < 9.5   to the left,  improve= 18.498690, (0 missing)
##       lang_count      < 1.5   to the left,  improve=  9.797405, (0 missing)
##       counts_details  < 0.785 to the left,  improve=  6.305873, (0 missing)
##       has_social      < 0.5   to the left,  improve=  5.557710, (0 missing)
##   Surrogate splits:
##       counts_pictures     < 6.5   to the left,  agree=0.613, adj=0.219, (0 split)
##       has_emoji           < 0.5   to the left,  agree=0.554, adj=0.099, (0 split)
##       counts_details      < 0.75  to the left,  agree=0.551, adj=0.093, (0 split)
##       lang_count          < 1.5   to the left,  agree=0.541, adj=0.074, (0 split)
##       flirtInterests_chat < 0.5   to the left,  agree=0.529, adj=0.049, (0 split)
## 
## Node number 4: 692 observations
##   predicted class=Low       expected loss=0.1604046  P(node) =0.2490104
##     class counts:   581   105     6     0
##    probabilities: 0.840 0.152 0.009 0.000 
## 
## Node number 5: 714 observations
##   predicted class=Low Mid   expected loss=0.3753501  P(node) =0.256927
##     class counts:   121   446   143     4
##    probabilities: 0.169 0.625 0.200 0.006 
## 
## Node number 6: 680 observations
##   predicted class=High Mid  expected loss=0.3691176  P(node) =0.2446923
##     class counts:     7   134   429   110
##    probabilities: 0.010 0.197 0.631 0.162 
## 
## Node number 7: 693 observations
##   predicted class=High      expected loss=0.1645022  P(node) =0.2493703
##     class counts:     2     1   111   579
##    probabilities: 0.003 0.001 0.160 0.835

The decision tree model for Profile Views was trained on a dataset encompassing 2780 observations, using isOnline, night_owl, and age as predictor variables. The variable isOnline was deemed the most crucial, contributing significantly to the total reduction of node impurity. night_owl and age followed in importance. Two key data divisions were created, reflected in the nsplit value, which successfully decreased the relative error. The primary division criteria revolved around isOnline, followed by age and night_owl. Supplementary division rules, termed surrogate splits, were also established.

Conversely, the decision tree model for Profile Likes was developed using a wider array of predictor variables: has_emoji, has_social, Profile_Views, counts_pictures, lang_count, flirtInterests_chat, flirtInterests_date, flirtInterests_friends, and counts_details. This model revealed Profile_Views as the most significant variable, with counts_pictures and counts_details next in line. The decision tree model generated four crucial splits, each progressively reducing the relative error. As with the first model, primary and surrogate splits were established, with the former centering around Profile_Views.

Each node in these decision tree models divulges key predictive details. To illustrate, Node 2 of the Profile Views tree houses 1632 observations. The predicted category here is ‘Low’, with the anticipated misclassification rate (expected loss) around 0.69. The node also provides a distribution of the target variable categories in terms of probabilities. This step is consistently applied across all nodes and both decision tree models.

The following code chunk applies the model to the test set:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      214     171      155  136
##   Low Mid    0       0        0    0
##   High Mid   0       0        0    0
##   High      84     127      143  162
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3154          
##                  95% CI : (0.2891, 0.3427)
##     No Information Rate : 0.25            
##     P-Value [Acc > NIR] : 0.0000002124    
##                                           
##                   Kappa : 0.0872          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.7181           0.00            0.00      0.5436
## Specificity              0.4832           1.00            1.00      0.6040
## Pos Pred Value           0.3166            NaN             NaN      0.3140
## Neg Pred Value           0.8372           0.75            0.75      0.7988
## Prevalence               0.2500           0.25            0.25      0.2500
## Detection Rate           0.1795           0.00            0.00      0.1359
## Detection Prevalence     0.5671           0.00            0.00      0.4329
## Balanced Accuracy        0.6007           0.50            0.50      0.5738
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      253      48        0    0
##   Low Mid   48     174       57    0
##   High Mid   4      71      186   52
##   High       0       1       53  246
## 
## Overall Statistics
##                                                
##                Accuracy : 0.72                 
##                  95% CI : (0.6936, 0.7454)     
##     No Information Rate : 0.2557               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6267               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.8295         0.5918          0.6284      0.8255
## Specificity              0.9459         0.8832          0.8584      0.9397
## Pos Pred Value           0.8405         0.6237          0.5942      0.8200
## Neg Pred Value           0.9417         0.8687          0.8750      0.9418
## Prevalence               0.2557         0.2464          0.2481      0.2498
## Detection Rate           0.2121         0.1459          0.1559      0.2062
## Detection Prevalence     0.2523         0.2339          0.2624      0.2515
## Balanced Accuracy        0.8877         0.7375          0.7434      0.8826

The output presents the confusion matrices and statistics for the two decision tree models’ performance on the testing data.

For the first model, it predicts four categories: Low, Low Mid, High Mid, and High. The model predicted Low for all observations and didn’t predict Low Mid and High Mid at all. As a result, the model’s accuracy is low at 31.85%, with a 95% confidence interval between 29.21% and 34.58%. The kappa statistic is 0.091, indicating poor agreement between the model’s predictions and the actual categories.

For each class, we can observe the following:

  1. Class Low: The model has a sensitivity or true positive rate of 73.91%, meaning it correctly identified 73.91% of the Low class instances. However, its positive predictive value (the proportion of true positives in the predicted positives) is just 31.44%. It indicates a high false positive rate. The model has a balanced accuracy of 60% for this class, which accounts for both sensitivity and specificity and is an overall measure of its performance.

  2. The model didn’t predict Low Mid and High Mid at all, which explains the zero values in Sensitivity, Pos Pred Value, and Detection Rate, and NA in Pos Pred Value.

  3. Class High: The model has a sensitivity of 53.36% and a positive predictive value of 32.45%, indicating that the model struggles to accurately identify and predict High class instances. The balanced accuracy is 58.19% for this class.

In the second model, the overall accuracy improves substantially to 71.67%, with a 95% confidence interval between 69.02% and 74.21%. The kappa statistic is 0.6222, indicating a moderate agreement between the model’s predictions and the actual categories.

For each class, we can observe the following:

  1. Class Low: The model has a high sensitivity of 82.95% and a positive predictive value of 84.05%. The balanced accuracy is 88.77% for this class, suggesting a good performance in identifying and predicting Low class instances.

  2. Class Low Mid: The model has a moderate sensitivity of 59.52% and a positive predictive value of 62.50%. The balanced accuracy for this class is 73.92%.

  3. Class High Mid: The model’s performance decreases for this class, with a sensitivity of 62.84% and a positive predictive value of 58.68%. The balanced accuracy for this class is 74.12%.

  4. Class High: The model performs well with this class, with a sensitivity of 80.87% and a positive predictive value of 81.69%. The balanced accuracy is 87.42% for this class.

The second model outperforms the first one in predicting the test data, with a substantially higher accuracy and moderate agreement between predictions and actual categories. However, there is room for improvement, particularly in predicting the Low Mid and High Mid classes.

Figure 9 below visualises the decision trees.

Figure 9: Results of decision tree models.

Random Forest Model

## 
## Call:
##  randomForest(formula = Profile_Views ~ isOnline + night_owl +      age + genderLooking, data = training_visits, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 68.35%
## Confusion matrix:
##          Low Low Mid High Mid High class.error
## Low      416      35       57  187   0.4014388
## Low Mid  342      40       54  259   0.9424460
## High Mid 298      43       54  300   0.9223022
## High     227      40       58  370   0.4676259
## 
## Call:
##  randomForest(formula = Profile_Likes ~ has_emoji + has_social +      Profile_Views + counts_pictures + lang_count + flirtInterests_chat +      flirtInterests_date + flirtInterests_friends + counts_details,      data = training_kisses, importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 27.78%
## Confusion matrix:
##          Low Low Mid High Mid High class.error
## Low      583     119        7    2   0.1800281
## Low Mid  116     423      143    4   0.3833819
## High Mid   7     143      426  113   0.3817126
## High       0      10      108  575   0.1702742
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      163     135      117   92
##   Low Mid   18       9       10   19
##   High Mid  32      28       26   27
##   High      84     126      145  160
## 
## Overall Statistics
##                                                
##                Accuracy : 0.3006               
##                  95% CI : (0.2746, 0.3275)     
##     No Information Rate : 0.2502               
##     P-Value [Acc > NIR] : 0.00004694           
##                                                
##                   Kappa : 0.0676               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.5488       0.030201         0.08725      0.5369
## Specificity              0.6152       0.947368         0.90258      0.6025
## Pos Pred Value           0.3215       0.160714         0.23009      0.3107
## Neg Pred Value           0.8041       0.745374         0.74768      0.7959
## Prevalence               0.2494       0.250210         0.25021      0.2502
## Detection Rate           0.1369       0.007557         0.02183      0.1343
## Detection Prevalence     0.4257       0.047019         0.09488      0.4324
## Balanced Accuracy        0.5820       0.488785         0.49491      0.5697
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      255      47        1    0
##   Low Mid   44     167       57    3
##   High Mid   6      78      183   53
##   High       0       2       55  242
## 
## Overall Statistics
##                                                
##                Accuracy : 0.71                 
##                  95% CI : (0.6833, 0.7356)     
##     No Information Rate : 0.2557               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6133               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.8361         0.5680          0.6182      0.8121
## Specificity              0.9459         0.8843          0.8473      0.9363
## Pos Pred Value           0.8416         0.6162          0.5719      0.8094
## Neg Pred Value           0.9438         0.8623          0.8706      0.9374
## Prevalence               0.2557         0.2464          0.2481      0.2498
## Detection Rate           0.2137         0.1400          0.1534      0.2028
## Detection Prevalence     0.2540         0.2272          0.2682      0.2506
## Balanced Accuracy        0.8910         0.7262          0.7328      0.8742

The outcomes obtained from the two separate random forest models are presented here may be interpreted as follows:

Profile Views Random Forest Model

This model aimed to predict ‘Profile_Views’ utilizing the features: ‘isOnline’, ‘night_owl’, ‘age’, and ‘genderLooking’. The training process involved 500 decision trees, with each split in the tree considering 2 variables.

An Out-of-Bag (OOB) error estimate, a commonly used internal measure of the accuracy of random forest models, was computed to be 69.17%. This indicates that the model could not accurately predict the outcomes in about 69.17% of the cases when applied to the out-of-bag sample.

An examination of the confusion matrix reveals varying rates of accuracy across the different classes. For example, the model exhibits the most accurate predictions for the ‘Low’ category, as evidenced by a class error rate of approximately 36.92%. Conversely, the ‘Low Mid’ and ‘High Mid’ categories showed substantial misclassification, reflected by the exceedingly high class error rates close to 98%.

Profile Likes Random Forest Model

The second model sought to predict ‘Profile_Likes’ based on a range of features including ‘has_emoji’, ‘has_social’, ‘Profile_Views’, ‘counts_pictures’, ‘lang_count’, ‘flirtInterests_chat’, ‘flirtInterests_date’, ‘flirtInterests_friends’, and ‘counts_details’. Similar to the first model, this one was also trained using 500 trees. However, at each split, this model considered 3 variables.

The OOB error rate for the second model is substantially lower at 27.45%, suggesting a better fit to the data as compared to the first model.

Upon analyzing the confusion matrix, it can be observed that the model demonstrated reasonable accuracy for the ‘Low’ and ‘High’ classes, with class error rates of 17.44% and 16.71% respectively. Nonetheless, the model encountered challenges with the ‘Low Mid’ and ‘High Mid’ categories, where the class error rates were 38.48% and 37.59% respectively.

Results from Testing Data

The profile views model demonstrates an overall accuracy rate of 29.28%, indicating that it accurately classifies the data approximately 29.28% of the time. The sensitivity, or true positive rate, varies considerably across classes, with the highest rate (60.74%) observed for the ‘Low’ category and the lowest rate (1.68%) for the ‘Low Mid’ category. The specificity, or true negative rate, also varies, ranging from 95.53% for the ‘Low Mid’ category to 55.26% for the ‘Low’ category. These variations suggest differential model performance across classes.

The profile likes outcome reveals a more satisfactory accuracy rate of 70.91%. In this case, both sensitivity and specificity are more evenly distributed across the classes, implying more consistent model performance.

In conclusion, the analysis suggests that the second model is more accurate and robust in making predictions compared to the first. It is also important to note that both models show varying performance levels when applied to different classes, which could be due to distinct characteristics within each class that the models capture with varying degrees of success.

Figure 10 below shows the importance of each variable in terms of predictive power for the profile likes random forest model.

Figure 10: Importance of variables of random forest model.

Discussion & Conclusion

The decision tree model demonstrates a level of efficacy; however, it doesn’t fully capture the intricate relationships within the data. It relies solely on one predictor, ‘profile views,’ to formulate its predictions. While ‘profile views’ may be a critical factor, ignoring other variables potentially diminishes the model’s performance.

In comparing the decision tree and random forest models, several key points emerge that offer insights into their relative strengths and weaknesses in this particular application.

Model Complexity and Understanding

The decision tree model has the advantage of being relatively simple to understand and interpret. Each decision within the tree corresponds to a question about one of the variables, making it a model that’s easy to visualize and explain. However, this simplicity can also be a limitation as it may not capture complex interactions among variables. This may explain its less-than-satisfactory performance on certain metrics, like sensitivity and specificity across various classes, and overall accuracy.

On the other hand, the random forest model, which operates by creating a multitude of decision trees and aggregating their results, is capable of capturing more complex patterns and interactions in the data. However, the trade-off is that it’s more challenging to interpret, as it essentially involves a multitude of decision processes rather than just one.

Performance

The decision tree model’s overall performance was modest at best, especially when compared to the random forest models. Both random forest models demonstrated significantly better performance in terms of overall accuracy and class-specific metrics such as sensitivity and specificity. It is worth noting, however, that even the random forest models had substantial differences in performance, likely due to the different variables included in each model and the number of variables tried at each split.

The decision tree’s performance was notably weak when trying to predict certain classes (‘Low Mid’ and ‘High Mid’), indicating that it struggled with differentiating among these classes. This suggests that a single decision tree might not have enough flexibility to capture the nuances of this particular dataset.

Robustness

Random forest models are known to be less prone to overfitting compared to decision tree models. This is because they average the results of many different trees, each of which is trained on a slightly different subset of the data. This difference in robustness is likely a contributing factor to the better performance of the random forest models on the test data.

Computational Complexity

From a computational perspective, the decision tree model is less resource-intensive, making it a more suitable choice for datasets with a large number of variables or instances. Random forests, however, can require significant computational resources, especially as the number of trees increases.

In conclusion, while the decision tree model might be more easily interpretable and computationally efficient, its performance in this specific scenario was significantly outperformed by the random forest models. This indicates that the random forest, with its ability to capture complex interactions and its robustness to overfitting, was more suited to this dataset. It’s a reminder that there’s always a trade-off between interpretability and predictive performance, and the best model depends on the specific context and the requirements of the analysis.